71 research outputs found

    State of the art in selection of variables and functional forms in multivariable analysis-outstanding issues

    Get PDF
    Background: How to select variables and identify functional forms for continuous variables is a key concern when creating a multivariable model. Ad hoc ‘traditional’ approaches to variable selection have been in use for at least 50 years. Similarly, methods for determining functional forms for continuous variables were first suggested many years ago. More recently, many alternative approaches to address these two challenges have been proposed, but knowledge of their properties and meaningful comparisons between them are scarce. To define a state of the art and to provide evidence-supported guidance to researchers who have only a basic level of statistical knowledge, many outstanding issues in multivariable modelling remain. Our main aims are to identify and illustrate such gaps in the literature and present them at a moderate technical level to the wide community of practitioners, researchers and students of statistics. Methods: We briefly discuss general issues in building descriptive regression models, strategies for variable selection, different ways of choosing functional forms for continuous variables and methods for combining the selection of variables and functions. We discuss two examples, taken from the medical literature, to illustrate problems in the practice of modelling. Results: Our overview revealed that there is not yet enough evidence on which to base recommendations for the selection of variables and functional forms in multivariable analysis. Such evidence may come from comparisons between alternative methods. In particular, we highlight seven important topics that require further investigation and make suggestions for the direction of further research. Conclusions: Selection of variables and of functional forms are important topics in multivariable analysis. To define a state of the art and to provide evidence-supported guidance to researchers who have only a basic level of statistical knowledge, further comparative research is required

    Stepwise classification of cancer samples using clinical and molecular data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Combining clinical and molecular data types may potentially improve prediction accuracy of a classifier. However, currently there is a shortage of effective and efficient statistical and bioinformatic tools for true integrative data analysis. Existing integrative classifiers have two main disadvantages: First, coarse combination may lead to subtle contributions of one data type to be overshadowed by more obvious contributions of the other. Second, the need to measure both data types for all patients may be both unpractical and (cost) inefficient.</p> <p>Results</p> <p>We introduce a novel classification method, a stepwise classifier, which takes advantage of the distinct classification power of clinical data and high-dimensional molecular data. We apply classification algorithms to two data types independently, starting with the traditional clinical risk factors. We only turn to relatively expensive molecular data when the uncertainty of prediction result from clinical data exceeds a predefined limit. Experimental results show that our approach is adaptive: the proportion of samples that needs to be re-classified using molecular data depends on how much we expect the predictive accuracy to increase when re-classifying those samples.</p> <p>Conclusions</p> <p>Our method renders a more cost-efficient classifier that is at least as good, and sometimes better, than one based on clinical or molecular data alone. Hence our approach is not just a classifier that minimizes a particular loss function. Instead, it aims to be cost-efficient by avoiding molecular tests for a potentially large subgroup of individuals; moreover, for these individuals a test result would be quickly available, which may lead to reduced waiting times (for diagnosis) and hence lower the patients distress. Stepwise classification is implemented in R-package <it>stepwiseCM </it>and available at the Bioconductor website.</p

    Gene expression of PMP22 is an independent prognostic factor for disease-free and overall survival in breast cancer patients

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Gene expression of peripheral myelin protein 22 (<it>PMP22</it>) and the epithelial membrane proteins (<it>EMPs</it>) was found to be differentially expressed in invasive and non-invasive breast cell lines in a previous study. We want to evaluate the prognostic impact of the expression of these genes on breast cancer.</p> <p>Methods</p> <p>In a retrospective multicenter study, gene expression of <it>PMP22 </it>and the <it>EMPs </it>was measured in 249 primary breast tumors by real-time PCR. Results were statistically analyzed together with clinical data.</p> <p>Results</p> <p>In univariable Cox regression analyses PMP22 and the EMPs were not associated with disease-free survival or tumor-related mortality. However, multivariable Cox regression revealed that patients with higher than median <it>PMP22 </it>gene expression have a 3.47 times higher risk to die of cancer compared to patients with equal values on clinical covariables but lower <it>PMP22 </it>expression. They also have a 1.77 times higher risk to relapse than those with lower <it>PMP22 </it>expression. The proportion of explained variation in overall survival due to <it>PMP22 </it>gene expression was 6.5% and thus PMP22 contributes equally to prognosis of overall survival as nodal status and estrogen receptor status. Cross validation demonstrates that 5-years survival rates can be refined by incorporating <it>PMP22 </it>into the prediction model.</p> <p>Conclusions</p> <p><it>PMP22 </it>gene expression is a novel independent prognostic factor for disease-free survival and overall survival for breast cancer patients. Including it into a model with established prognostic factors will increase the accuracy of prognosis.</p

    Intrinsic bias in breast cancer gene expression data sets

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>While global breast cancer gene expression data sets have considerable commonality in terms of their data content, the populations that they represent and the data collection methods utilized can be quite disparate. We sought to assess the extent and consequence of these systematic differences with respect to identifying clinically significant prognostic groups.</p> <p>Methods</p> <p>We ascertained how effectively unsupervised clustering employing randomly generated sets of genes could segregate tumors into prognostic groups using four well-characterized breast cancer data sets.</p> <p>Results</p> <p>Using a common set of 5,000 randomly generated lists (70 genes/list), the percentages of clusters with significant differences in metastasis latencies (HR p-value < 0.01) was 62%, 15%, 21% and 0% in the NKI2 (Netherlands Cancer Institute), Wang, TRANSBIG and KJX64/KJ125 data sets, respectively. Among ER positive tumors, the percentages were 38%, 11%, 4% and 0%, respectively. Few random lists were predictive among ER negative tumors in any data set. Clustering was associated with ER status and, after globally adjusting for the effects of ER-α gene expression, the percentages were 25%, 33%, 1% and 0%, respectively. The impact of adjusting for ER status depended on the extent of confounding between ER-α gene expression and markers of proliferation.</p> <p>Conclusion</p> <p>It is highly probable to identify a statistically significant association between a given gene list and prognosis in the NKI2 dataset due to its large sample size and the interrelationship between ER-α expression and markers of proliferation. In most respects, the TRANSBIG data set generated similar outcomes as the NKI2 data set, although its smaller sample size led to fewer statistically significant results.</p

    Survival prediction from clinico-genomic models - a comparative study

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Survival prediction from high-dimensional genomic data is an active field in today's medical research. Most of the proposed prediction methods make use of genomic data alone without considering established clinical covariates that often are available and known to have predictive value. Recent studies suggest that combining clinical and genomic information may improve predictions, but there is a lack of systematic studies on the topic. Also, for the widely used Cox regression model, it is not obvious how to handle such combined models.</p> <p>Results</p> <p>We propose a way to combine classical clinical covariates with genomic data in a clinico-genomic prediction model based on the Cox regression model. The prediction model is obtained by a simultaneous use of both types of covariates, but applying dimension reduction only to the high-dimensional genomic variables. We describe how this can be done for seven well-known prediction methods: variable selection, unsupervised and supervised principal components regression and partial least squares regression, ridge regression, and the lasso. We further perform a systematic comparison of the performance of prediction models using clinical covariates only, genomic data only, or a combination of the two. The comparison is done using three survival data sets containing both clinical information and microarray gene expression data. Matlab code for the clinico-genomic prediction methods is available at <url>http://www.med.uio.no/imb/stat/bmms/software/clinico-genomic/</url>.</p> <p>Conclusions</p> <p>Based on our three data sets, the comparison shows that established clinical covariates will often lead to better predictions than what can be obtained from genomic data alone. In the cases where the genomic models are better than the clinical, ridge regression is used for dimension reduction. We also find that the clinico-genomic models tend to outperform the models based on only genomic data. Further, clinico-genomic models and the use of ridge regression gives for all three data sets better predictions than models based on the clinical covariates alone.</p

    Stromal Genes Add Prognostic Information to Proliferation and Histoclinical Markers: A Basis for the Next Generation of Breast Cancer Gene Signatures

    Get PDF
    BACKGROUND: First-generation gene signatures that identify breast cancer patients at risk of recurrence are confined to estrogen-positive cases and are driven by genes involved in the cell cycle and proliferation. Previously we induced sets of stromal genes that are prognostic for both estrogen-positive and estrogen-negative samples. Creating risk-management tools that incorporate these stromal signatures, along with existing proliferation-based signatures and established clinicopathological measures such as lymph node status and tumor size, should better identify women at greatest risk for metastasis and death. METHODOLOGY/PRINCIPAL FINDINGS: To investigate the strength and independence of the stromal and proliferation factors in estrogen-positive and estrogen-negative patients we constructed multivariate Cox proportional hazards models along with tree-based partitions of cancer cases for four breast cancer cohorts. Two sets of stromal genes, one consisting of DCN and FBLN1, and the other containing LAMA2, add substantial prognostic value to the proliferation signal and to clinical measures. For estrogen receptor-positive patients, the stromal-decorin set adds prognostic value independent of proliferation for three of the four datasets. For estrogen receptor-negative patients, the stromal-laminin set significantly adds prognostic value in two datasets, and marginally in a third. The stromal sets are most prognostic for the unselected population studies and may depend on the age distribution of the cohorts. CONCLUSION: The addition of stromal genes would measurably improve the performance of proliferation-based first-generation gene signatures, especially for older women. Incorporating indicators of the state of stromal cell types would mark a conceptual shift from epithelial-centric risk assessment to assessment based on the multiple cell types in the cancer-altered tissue

    Do Two Machine-Learning Based Prognostic Signatures for Breast Cancer Capture the Same Biological Processes?

    Get PDF
    The fact that there is very little if any overlap between the genes of different prognostic signatures for early-discovery breast cancer is well documented. The reasons for this apparent discrepancy have been explained by the limits of simple machine-learning identification and ranking techniques, and the biological relevance and meaning of the prognostic gene lists was questioned. Subsequently, proponents of the prognostic gene lists claimed that different lists do capture similar underlying biological processes and pathways. The present study places under scrutiny the validity of this claim, for two important gene lists that are at the focus of current large-scale validation efforts. We performed careful enrichment analysis, controlling the effects of multiple testing in a manner which takes into account the nested dependent structure of gene ontologies. In contradiction to several previous publications, we find that the only biological process or pathway for which statistically significant concordance can be claimed is cell proliferation, a process whose relevance and prognostic value was well known long before gene expression profiling. We found that the claims reported by others, of wider concordance between the biological processes captured by the two prognostic signatures studied, were found either to be lacking statistical rigor or were in fact based on addressing some other question

    Breast cancer prognostic classification in the molecular era: the role of histological grade

    Get PDF
    Breast cancer is a heterogeneous disease with varied morphological appearances, molecular features, behavior, and response to therapy. Current routine clinical management of breast cancer relies on the availability of robust clinical and pathological prognostic and predictive factors to support clinical and patient decision making in which potentially suitable treatment options are increasingly available. One of the best-established prognostic factors in breast cancer is histological grade, which represents the morphological assessment of tumor biological characteristics and has been shown to be able to generate important information related to the clinical behavior of breast cancers. Genome-wide microarray-based expression profiling studies have unraveled several characteristics of breast cancer biology and have provided further evidence that the biological features captured by histological grade are important in determining tumor behavior. Also, expression profiling studies have generated clinically useful data that have significantly improved our understanding of the biology of breast cancer, and these studies are undergoing evaluation as improved prognostic and predictive tools in clinical practice. Clinical acceptance of these molecular assays will require them to be more than expensive surrogates of established traditional factors such as histological grade. It is essential that they provide additional prognostic or predictive information above and beyond that offered by current parameters. Here, we present an analysis of the validity of histological grade as a prognostic factor and a consensus view on the significance of histological grade and its role in breast cancer classification and staging systems in this era of emerging clinical use of molecular classifiers. © 2010 BioMed Central Lt

    Nottingham Prognostic Index Plus (NPI+): a modern clinical decision making tool in breast cancer

    Get PDF
    Background: Current management of breast cancer (BC) relies on risk stratification based on well-defined clinicopathologic factors. Global gene expression profiling studies have demonstrated that BC comprises distinct molecular classes with clinical relevance. In this study, we hypothesised that molecular features of BC are a key driver of tumour behaviour and when coupled with a novel and bespoke application of established clinicopathologic prognostic variables can predict both clinical outcome and relevant therapeutic options more accurately than existing methods. Methods: In the current study, a comprehensive panel of biomarkers with relevance to BC was applied to a large and well-characterised series of BC, using immunohistochemistry and different multivariate clustering techniques, to identify the key molecular classes. Subsequently, each class was further stratified using a set of well-defined prognostic clinicopathologic variables. These variables were combined in formulae to prognostically stratify different molecular classes, collectively known as the Nottingham Prognostic Index Plus (NPI+). The NPI+ was then used to predict outcome in the different molecular classes. Results: Seven core molecular classes were identified using a selective panel of 10 biomarkers. Incorporation of clinicopathologic variables in a second-stage analysis resulted in identification of distinct prognostic groups within each molecular class (NPI+). Outcome analysis showed that using the bespoke NPI formulae for each biological BC class provides improved patient outcome stratification superior to the traditional NPI. Conclusion: This study provides proof-of-principle evidence for the use of NPI+ in supporting improved individualised clinical decision making
    corecore